Lab 6

Author

Caitlin Rasbid

Published

April 4, 2025

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.7     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.3.0     ✔ yardstick    1.3.2
✔ recipes      1.1.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
library(powerjoin)
library(glue)
library(vip)

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(baguette)
root  <- 'https://gdex.ucar.edu/dataset/camels/file'
download.file('https://gdex.ucar.edu/dataset/camels/file/camels_attributes_v2.0.pdf', 
              'data/camels_attributes_v2.0.pdf')
types <- c("clim", "geol", "soil", "topo", "vege", "hydro")

remote_files  <- glue('{root}/camels_{types}.txt')

local_files   <- glue('data/camels_{types}.txt')

walk2(remote_files, local_files, download.file, quiet = TRUE)

camels <- map(local_files, read_delim, show_col_types = FALSE) 

camels <- power_full_join(camels ,by = 'gauge_id')

Question 1

The zero_q_freq variable is the frequency of days where Q = 0 mm/day as a percentage, with Q being discharge, a measure of the volume of water flowing past a point on a stream.

Exploratory Data Analysis

ggplot(data = camels, aes(x = gauge_lon, y = gauge_lat)) +
  borders("state", colour = "gray50") +
  geom_point(aes(color = q_mean)) +
  scale_color_gradient(low = "pink", high = "dodgerblue") +
  ggthemes::theme_map()

Question 2

camels_long <- camels %>%
  pivot_longer(cols = c(aridity, p_mean), 
               names_to = "variable", 
               values_to = "value")

ggplot(data = camels_long, aes(x = gauge_lon, y = gauge_lat)) +
  borders("state", colour = "gray50") +
  geom_point(aes(color = value)) +  
  scale_color_gradient(low = "pink", high = "dodgerblue") +
  ggthemes::theme_map() +
  facet_wrap(~ variable, scales = "free") + 
  scale_color_gradient(low = "pink", high = "dodgerblue") +
  labs(
    title = "Maps of Aridity and P-Mean",  
    x = "Longitude", 
    y = "Latitude",  
    color = "Measurement Value"  
  )
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

Model Preparation & Building

camels |> 
  select(aridity, p_mean, q_mean) |> 
  drop_na() |> 
  cor()
           aridity     p_mean     q_mean
aridity  1.0000000 -0.7550090 -0.5817771
p_mean  -0.7550090  1.0000000  0.8865757
q_mean  -0.5817771  0.8865757  1.0000000
ggplot(camels, aes(x = aridity, y = p_mean)) +
  geom_point(aes(color = q_mean)) +
  geom_smooth(method = "lm", color = "red", linetype = 2) +
  scale_color_viridis_c() +
  theme_linedraw() + 
  theme(legend.position = "bottom") + 
  labs(title = "Aridity vs Rainfall vs Runnoff", 
       x = "Aridity", 
       y = "Rainfall",
       color = "Mean Flow")
`geom_smooth()` using formula = 'y ~ x'

ggplot(camels, aes(x = aridity, y = p_mean)) +
  geom_point(aes(color = q_mean)) +
  geom_smooth(method = "lm") +
  scale_color_viridis_c() +
  scale_x_log10() + 
  scale_y_log10() +
  theme_linedraw() +
  theme(legend.position = "bottom") + 
  labs(title = "Aridity vs Rainfall vs Runnoff", 
       x = "Aridity", 
       y = "Rainfall",
       color = "Mean Flow")
`geom_smooth()` using formula = 'y ~ x'

ggplot(camels, aes(x = aridity, y = p_mean)) +
  geom_point(aes(color = q_mean)) +
  geom_smooth(method = "lm") +
  scale_color_viridis_c(trans = "log") +
  scale_x_log10() + 
  scale_y_log10() +
  theme_linedraw() +
  theme(legend.position = "bottom",
        legend.key.width = unit(2.5, "cm"),
        legend.key.height = unit(.5, "cm")) + 
  labs(title = "Aridity vs Rainfall vs Runnoff", 
       x = "Aridity", 
       y = "Rainfall",
       color = "Mean Flow") 
`geom_smooth()` using formula = 'y ~ x'

set.seed(123)

camels <- camels |> 
  mutate(logQmean = log(q_mean))

camels_split <- initial_split(camels, prop = 0.8)
camels_train <- training(camels_split)
camels_test  <- testing(camels_split)

camels_cv <- vfold_cv(camels_train, v = 10)
rec <-  recipe(logQmean ~ aridity + p_mean, data = camels_train) %>%
  step_log(all_predictors()) %>%
  step_interact(terms = ~ aridity:p_mean) %>%
  step_naomit(all_predictors(), all_outcomes())
baked_data <- prep(rec, camels_train) |> 
  bake(new_data = NULL)

lm_base <- lm(logQmean ~ aridity * p_mean, data = baked_data)
summary(lm_base)

Call:
lm(formula = logQmean ~ aridity * p_mean, data = baked_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.91162 -0.21601 -0.00716  0.21230  2.85706 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.77586    0.16365 -10.852  < 2e-16 ***
aridity        -0.88397    0.16145  -5.475 6.75e-08 ***
p_mean          1.48438    0.15511   9.570  < 2e-16 ***
aridity:p_mean  0.10484    0.07198   1.457    0.146    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7684 
F-statistic: 591.6 on 3 and 531 DF,  p-value: < 2.2e-16
summary(lm(logQmean ~ aridity + p_mean + aridity_x_p_mean, data = baked_data))

Call:
lm(formula = logQmean ~ aridity + p_mean + aridity_x_p_mean, 
    data = baked_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.91162 -0.21601 -0.00716  0.21230  2.85706 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.77586    0.16365 -10.852  < 2e-16 ***
aridity          -0.88397    0.16145  -5.475 6.75e-08 ***
p_mean            1.48438    0.15511   9.570  < 2e-16 ***
aridity_x_p_mean  0.10484    0.07198   1.457    0.146    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7684 
F-statistic: 591.6 on 3 and 531 DF,  p-value: < 2.2e-16
test_data <-  bake(prep(rec), new_data = camels_test)
test_data$lm_pred <- predict(lm_base, newdata = test_data)

metrics(test_data, truth = logQmean, estimate = lm_pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.583
2 rsq     standard       0.742
3 mae     standard       0.390
ggplot(test_data, aes(x = logQmean, y = lm_pred, colour = aridity)) +
  scale_color_gradient2(low = "brown", mid = "orange", high = "darkgreen") +
  geom_point() +
  geom_abline(linetype = 2) +
  theme_linedraw() + 
  labs(title = "Linear Model: Observed vs Predicted",
       x = "Observed Log Mean Flow",
       y = "Predicted Log Mean Flow",
       color = "Aridity")

lm_model <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

lm_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lm_model) %>%
  fit(data = camels_train) 


summary(extract_fit_engine(lm_wf))$coefficients
                   Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      -1.7758557 0.16364755 -10.851710 6.463654e-25
aridity          -0.8839738 0.16144589  -5.475357 6.745512e-08
p_mean            1.4843771 0.15511117   9.569762 4.022500e-20
aridity_x_p_mean  0.1048449 0.07198145   1.456555 1.458304e-01
summary(lm_base)$coefficients
                 Estimate Std. Error    t value     Pr(>|t|)
(Intercept)    -1.7758557 0.16364755 -10.851710 6.463654e-25
aridity        -0.8839738 0.16144589  -5.475357 6.745512e-08
p_mean          1.4843771 0.15511117   9.569762 4.022500e-20
aridity:p_mean  0.1048449 0.07198145   1.456555 1.458304e-01
lm_data <- augment(lm_wf, new_data = camels_test)
dim(lm_data)
[1] 135  61
metrics(lm_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.583
2 rsq     standard       0.742
3 mae     standard       0.390
ggplot(lm_data, aes(x = logQmean, y = .pred, colour = aridity)) +
  scale_color_viridis_c() +
  geom_point() +
  geom_abline() +
  theme_linedraw()

library(baguette)
rf_model <- rand_forest() %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

rf_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(rf_model) %>%
  fit(data = camels_train) 

rf_data <- augment(rf_wf, new_data = camels_test)
dim(rf_data)
[1] 135  60
metrics(rf_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.587
2 rsq     standard       0.741
3 mae     standard       0.363
ggplot(rf_data, aes(x = logQmean, y = .pred, colour = aridity)) +
  scale_color_viridis_c() +
  geom_point() +
  geom_abline() +
  theme_linedraw()

wf <- workflow_set(list(rec), list(lm_model, rf_model)) %>%
  workflow_map('fit_resamples', resamples = camels_cv) 

autoplot(wf)

rank_results(wf, rank_metric = "rsq", select_best = TRUE)
# A tibble: 4 × 9
  wflow_id          .config .metric  mean std_err     n preprocessor model  rank
  <chr>             <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
1 recipe_rand_fore… Prepro… rmse    0.563  0.0247    10 recipe       rand…     1
2 recipe_rand_fore… Prepro… rsq     0.771  0.0259    10 recipe       rand…     1
3 recipe_linear_reg Prepro… rmse    0.569  0.0260    10 recipe       line…     2
4 recipe_linear_reg Prepro… rsq     0.770  0.0223    10 recipe       line…     2

Question 3

# Boost Tree 
bt_model <- boost_tree() %>%
  set_engine("xgboost") %>%
  set_mode("regression")

bt_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(bt_model) %>%
  fit(data = camels_train) 

bt_data <- augment(bt_wf, new_data = camels_test)
dim(bt_data)
[1] 135  60
metrics(bt_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.631
2 rsq     standard       0.702
3 mae     standard       0.397
ggplot(bt_data, aes(x = logQmean, y = .pred, colour = aridity)) +
  scale_color_viridis_c() +
  geom_point() +
  geom_abline() +
  theme_linedraw()

# Neural Network
nnet_model <- bag_mlp() %>%
  set_engine("nnet") %>%
  set_mode("regression")

nnet_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(nnet_model) %>%
  fit(data = camels_train) 

nnet_data <- augment(nnet_wf, new_data = camels_test)
dim(nnet_data)
[1] 135  61
metrics(nnet_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.545
2 rsq     standard       0.772
3 mae     standard       0.337
ggplot(nnet_data, aes(x = logQmean, y = .pred, colour = aridity)) +
  scale_color_viridis_c() +
  geom_point() +
  geom_abline() +
  theme_linedraw()

# Add to existing workflow
wf <- workflow_set(list(rec), list(lm_model, rf_model, nnet_model, bt_model)) %>%
  workflow_map('fit_resamples', resamples = camels_cv) 

autoplot(wf)

rank_results(wf, rank_metric = "rsq", select_best = TRUE)
# A tibble: 8 × 9
  wflow_id          .config .metric  mean std_err     n preprocessor model  rank
  <chr>             <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
1 recipe_bag_mlp    Prepro… rmse    0.547  0.0282    10 recipe       bag_…     1
2 recipe_bag_mlp    Prepro… rsq     0.786  0.0234    10 recipe       bag_…     1
3 recipe_rand_fore… Prepro… rmse    0.564  0.0260    10 recipe       rand…     2
4 recipe_rand_fore… Prepro… rsq     0.770  0.0266    10 recipe       rand…     2
5 recipe_linear_reg Prepro… rmse    0.569  0.0260    10 recipe       line…     3
6 recipe_linear_reg Prepro… rsq     0.770  0.0223    10 recipe       line…     3
7 recipe_boost_tree Prepro… rmse    0.600  0.0289    10 recipe       boos…     4
8 recipe_boost_tree Prepro… rsq     0.745  0.0268    10 recipe       boos…     4

Compared to the other two models, the Neural Network model outperforms the other models and the Xgboost Regression model underperforms the other models. The linear model is still ranked above the Xgboost model and this could be because the tree-based method of the Xgboost model was not able to capture complexities in the relatively small dataset. However, the linear model was outperformed by the random forest model and neural network, suggesting a simple linear relationship is not sufficient to explain the interactions between terms. The flexible nature of the Neural Network model and its ability to capture complexities of the dataset and learn from complex interactions makes it the best model, as demonstrated in the above table and graph. For these reasons I would continue with the Neural Network model.

Build Your Own

#Data Splitting
set.seed(124)

camels <- camels %>%
  mutate(logQmean = log(q_mean)) %>%
  select(logQmean, p_mean, aridity, soil_depth_pelletier, max_water_content, organic_frac, frac_snow, pet_mean, soil_depth_statsgo, elev_mean) %>%
  drop_na()

c_split <- initial_split(camels, prop = 0.8)
c_train <- training(c_split)
c_test  <- testing(c_split)

c_cv <- vfold_cv(c_train, v = 10)
# Recipe
rec2 <-  recipe(logQmean ~ . , data = c_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) 

I chose to use the above formula because it compares the predictors to the log transformed Q mean. Doing a log transformation on the Q mean proved to establish a more predictable relationship between the predictors p_mean and aridity, so I maintained this log transformation. Since there are many predictors, I chose to use step_scale and step_center to normalize the data on a universal scale and allow for more accurate model creation.

c_baked <- prep(rec2, c_train) |> 
  bake(new_data = NULL)

lm_base2 <- lm(logQmean ~ . , data = c_baked)
summary(lm_base2)

Call:
lm(formula = logQmean ~ ., data = c_baked)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7726 -0.1543  0.0502  0.2400  3.9490 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.081749   0.022536  -3.627 0.000314 ***
p_mean                0.532187   0.039029  13.636  < 2e-16 ***
aridity              -0.643846   0.049819 -12.924  < 2e-16 ***
soil_depth_pelletier -0.077194   0.029114  -2.651 0.008256 ** 
max_water_content     0.058350   0.043250   1.349 0.177877    
organic_frac          0.005265   0.025468   0.207 0.836304    
frac_snow             0.287464   0.051639   5.567 4.14e-08 ***
pet_mean              0.137612   0.028827   4.774 2.35e-06 ***
soil_depth_statsgo   -0.156120   0.039232  -3.979 7.88e-05 ***
elev_mean             0.034353   0.057964   0.593 0.553668    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5218 on 526 degrees of freedom
Multiple R-squared:  0.8012,    Adjusted R-squared:  0.7978 
F-statistic: 235.5 on 9 and 526 DF,  p-value: < 2.2e-16
# Linear Model
lm_model2 <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

lm_wf2 <- workflow() %>%
  add_recipe(rec2) %>%
  add_model(lm_model2) %>%
  fit(data = c_train)
summary(extract_fit_engine(lm_wf2))$coefficients
                         Estimate Std. Error     t value     Pr(>|t|)
(Intercept)          -0.081748780 0.02253641  -3.6274094 3.141820e-04
p_mean                0.532187271 0.03902852  13.6358550 1.809987e-36
aridity              -0.643845847 0.04981868 -12.9237847 2.239196e-33
soil_depth_pelletier -0.077194364 0.02911388  -2.6514630 8.256109e-03
max_water_content     0.058350264 0.04325039   1.3491270 1.778766e-01
organic_frac          0.005264932 0.02546817   0.2067260 8.363039e-01
frac_snow             0.287464074 0.05163924   5.5667757 4.141554e-08
pet_mean              0.137611570 0.02882695   4.7737118 2.347449e-06
soil_depth_statsgo   -0.156119578 0.03923198  -3.9793961 7.878971e-05
elev_mean             0.034352637 0.05796409   0.5926537 5.536676e-01
lm_data2 <- augment(lm_wf2, new_data = c_test)
metrics(lm_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.447
2 rsq     standard       0.874
3 mae     standard       0.302
# Random Forest Model
rf_model2 <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("regression")

rf_wf2 <- workflow() %>%
  add_recipe(rec2) %>%
  add_model(rf_model2) %>%
  fit(data = c_train)

rf_data2 <- augment(rf_wf2, new_data = c_test)
metrics(rf_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.290
2 rsq     standard       0.946
3 mae     standard       0.191
# Boost Tree
bt_model2 <- boost_tree() %>%
  set_engine("xgboost") %>%
  set_mode("regression")

bt_wf2 <- workflow() %>%
  add_recipe(rec2) %>%
  add_model(bt_model2) %>%
  fit(data = c_train) 

bt_data2 <- augment(bt_wf2, new_data = c_test)

metrics(bt_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.384
2 rsq     standard       0.903
3 mae     standard       0.241
# Neural Network

nnet_model2 <- bag_mlp() %>%
  set_engine("nnet") %>%
  set_mode("regression")

nnet_wf2 <- workflow() %>%
  add_recipe(rec2) %>%
  add_model(nnet_model2) %>%
  fit(data = c_train) 

nnet_data2 <- augment(nnet_wf2, new_data = c_test)

metrics(nnet_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.303
2 rsq     standard       0.943
3 mae     standard       0.203
# Workflow Set
wf2 <- workflow_set(list(rec2), list(lm_model2, rf_model2, nnet_model2, bt_model2)) %>%
  workflow_map('fit_resamples', resamples = c_cv) 
# Evaluation
autoplot(wf2)

rank_results(wf2, rank_metric = "rsq", select_best = TRUE)
# A tibble: 8 × 9
  wflow_id          .config .metric  mean std_err     n preprocessor model  rank
  <chr>             <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
1 recipe_rand_fore… Prepro… rmse    0.389  0.0273    10 recipe       rand…     1
2 recipe_rand_fore… Prepro… rsq     0.896  0.0107    10 recipe       rand…     1
3 recipe_bag_mlp    Prepro… rmse    0.399  0.0331    10 recipe       bag_…     2
4 recipe_bag_mlp    Prepro… rsq     0.888  0.0115    10 recipe       bag_…     2
5 recipe_boost_tree Prepro… rmse    0.409  0.0312    10 recipe       boos…     3
6 recipe_boost_tree Prepro… rsq     0.882  0.0106    10 recipe       boos…     3
7 recipe_linear_reg Prepro… rmse    0.531  0.0335    10 recipe       line…     4
8 recipe_linear_reg Prepro… rsq     0.800  0.0136    10 recipe       line…     4

The best model evaluating the relationship between logQmean and the predictor variables turned out to be the Neural Network model. Similar to the previous example, the flexibility of this model and its ability to capture complex interactions and relationships makes it the best model. The relationships between variables is non-linear, making the linear model insufficient and decision-tree based models perform better than the linear, but ultimately not as well as the neural network. It ranks first compared to all other model types for its R-squared value, with >94 % of the variation in logQmean being accounted for by the predictor variables. For these reasons the Neural Network model is the best of them all.

# Extract and Evaluate 
# workflow created in initial model creation above
ggplot(nnet_data2, aes(x = logQmean, y = .pred)) +
  scale_color_viridis_c() +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline() +
  theme_linedraw() +
  labs(
    title = "Observed vs. Predicted Log Mean Streamflow",
    x = "Observed logQmean",
    y = "Predicted logQmean"
  ) +
  theme_minimal()

The model seems to be a good fit for predicting logQmean across a number of predictor variables. The line of best fit represents the overall trend and slope of the data, although there is some variation between actual and predicted values. I think this model could be fine tuned by including even more predictor variables to increase its predictive capacity. However, no model will be 100% accurate in predicting streamflow on a given day due to daily variations and the presence of outliers, so I think this model is an excellent start for making predictions.